Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WDL call caching #5105

Merged
merged 48 commits into from
Oct 28, 2024
Merged

Support WDL call caching #5105

merged 48 commits into from
Oct 28, 2024

Conversation

adamnovak
Copy link
Member

This adds support for WDL call caching compatible with MiniWDL, controlled by the same MiniWDL config/environment variables as MiniWDL uses. This should fix #4797.

It only caches task calls, and the write_* WDL function results necessary to have calls that depend on workflow-generated files.

It copies all output files from a Toil step into the MiniWDL cache folder, when saving to the cache, since Toil doesn't usually generate a persistent copy of task output files.

It does not cache downloads like MiniWDL can, so tasks that depend on URL files probably can't cache (or might get stuck cached and not update when the URL content changes).

It doesn't do anything special for string to File coercion, but probably should.

If you have this workflow:

version 1.1

workflow TestWorkflow {
    input {
    }
    call TestTask {
    input:
    }
    call WCTask as count_listing {
        input:
            to_count = TestTask.dirlist
    }
    call WCTask as count_dynamic {
        input:
            to_count = TestTask.trouble
    }
    call WCTask as count_workflow_scope {
        input:
            to_count = write_lines(["At", "workflow", "scope"])
    }
    output {
        Array[File] counts = [count_listing.result, count_dynamic.result, count_workflow_scope.result]
    }
}

task TestTask {
    input {
    }
    command <<<
        ls -l
    >>>
    output {
        Int number = 1
        File dirlist = stdout()
        File trouble = write_lines(["Hello", "World"])
    }
    runtime {
        container: "ubuntu:latest"
        cpu: 1
    }
}

task WCTask {
    input {
        File to_count
    }
    command <<<
        wc -l ~{to_count}
    >>>
    output {
        File result = stdout()
    }
    runtime {
        container: "ubuntu:latest"
        cpu: 1
    }
}

Then you can run it with either of:

MINIWDL__CALL_CACHE__GET=true MINIWDL__CALL_CACHE__PUT=true MINIWDL__CALL_CACHE__DIR=`pwd`/miniwdl-cache miniwdl run test.wdl

Or

MINIWDL__CALL_CACHE__GET=true MINIWDL__CALL_CACHE__PUT=true MINIWDL__CALL_CACHE__DIR=`pwd`/miniwdl-cache toil-wdl-runner --logDebug --jobStore ./tree test.wdl

And the second one will get cache hits for all the tasks that the first one ran.

Changelog Entry

To be copied to the draft changelog by merger:

  • Toil now supports reading and writing MiniWDL's call cache.

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@adamnovak
Copy link
Member Author

I need to make this also support caching calls to whole workflows. If I use the files in the output tasks' caches instead of fetching the cached results of one workflow, I get different paths (that symlink to the same files), and then I can't re-use MiniWDL cached calls that depend on files from the called workflow.

@adamnovak
Copy link
Member Author

I think we might run into problems with sibling files with WDL call caching when reading caches written by MinIWDL. MiniWDL will lay out task outputs by output name, and then reference those locations in the cache files. So files that were siblings won't be anymore when loaded from MiniWDL's task output directories.

I think we might just have to let that not work. When Toil writes files into the cache to reference them it should obey the sibling files constraint because it uses the same code as when writing them for a task to pick paths.

@adamnovak
Copy link
Member Author

https://ucsc-ci.com/databiosphere/toil/-/jobs/79122 got a Cannot assign requested address errors trying to do an HTTP download in src/toil/test/wdl/wdltoil_test.py::WDLTests::test_miniwdl_self_test_by_reference. I don't think it has anything to do with that test specifically, except that it needs to do networking. We might have a problem with network connectivity on the Gitlab runners in addition to the problem that led us to assign a static IP to the main Gitlab server.

@stxue1 stxue1 mentioned this pull request Oct 1, 2024
19 tasks
Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks alright to me. Is there a way a test could be added?

@@ -267,7 +268,7 @@ def _runStep(self):
if self.checkOnJobs():
activity = True
if not activity:
logger.debug('No activity, sleeping for %is', self.boss.sleepSeconds())
logger.log(TRACE, 'No activity, sleeping for %is', self.boss.sleepSeconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go back to debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I lowered this to TRACE since I didn't think it made sense to log every second at debug level.

@@ -49,14 +49,14 @@ def print_dot_chart(self) -> None:

# Make job IDs to node names map
jobsToNodeNames: Dict[str, str] = dict(
map(lambda job: (str(job.jobStoreID), job.jobName), self.jobsToReport)
map(lambda job: (str(job.jobStoreID), str(job.jobStoreID).replace("_", "___").replace("/", "_").replace("-", "__")), self.jobsToReport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... name_/something seems equivalent to name/_something after conversion. Is that alright?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point, I didn't think about adjacent sequences.

imported = file_dest.import_file(candidate_uri, check_existence=False)
else:
# The file store import_file doesn't do an existence check.
# TODO: Have a more compatible interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way?

@adamnovak
Copy link
Member Author

@stxue1 Can you maybe look at this? I tried to merge my code that tacks a shared filesystem path onto every File that has one with your code that defers virtualization to task boundaries, but I know I'm missing at least one route that files can come in, because I don't think I am managing to add the shared filesystem path when a workflow-level string is coerced to a file and then virtualized. Where is that happening, and at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

Also, I went through and turned all the setattr file mutations and file.value = mutations into immutable replacements of the file with the version that has the attribute set/has its value set, because I was not confident that I understood the implications of changing the original File object. Is this approach going to work?

@stxue1
Copy link
Contributor

stxue1 commented Oct 8, 2024

@stxue1 Can you maybe look at this? I tried to merge my code that tacks a shared filesystem path onto every File that has one with your code that defers virtualization to task boundaries, but I know I'm missing at least one route that files can come in, because I don't think I am managing to add the shared filesystem path when a workflow-level string is coerced to a file and then virtualized. Where is that happening, and at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

The virtualization for when a workflow-level string being coerced to a file takes place in two areas. On task boundaries, virtualization is handled when virtualize_files() is called manually. The other place is when _read/_write is called, which happens when a string value is used implicitly in a standard library function that accepts files. Both ultimately point back to the _virtualize_file function, so I believe for optimal coverage caching should be right before this function is called.

I think the two places that matter is in the virtualize_filescall at the task boundary in WDLTaskWrapperJob (I think a file can be coerced from a string here, but I will have to double check). The other place that would matter is in the _read and _write calls in the standard library. I think it would be the _read call in ToilWDLStdLibBase and the _write call in ToilWDLStdLibTaskCommand, though my immediate hunch is that the TaskCommand stdlib shouldn't be touched as that is more for miniwdl internally (I left a docstring for the reasoning why they differ, if it isn't clear I can come up with a better docstring). It seems like the new ToilWDLStdLibWorkflow._write does properly handle the virtualization and adds the caching system along with it. I think the function to check if caching is needed/supported is the ToilWDLStdLibBase, which still seems to be used in certain areas.

Also, I went through and turned all the setattr file mutations and file.value = mutations into immutable replacements of the file with the version that has the attribute set/has its value set, because I was not confident that I understood the implications of changing the original File object. Is this approach going to work?

The mutable replacements should be fine, the metadata they copy over includes the necessary information for virtualization.

@stxue1
Copy link
Contributor

stxue1 commented Oct 8, 2024

at that point do we have a good way to distinguish a leader-filesystem file from a worker-filesystem file?

I suppose there isn't currently a very good way to distinguish if the file.value is pointing to a worker filesystem file or a leader-filesystem file. Any use of a file will automatically try and read from the virtualized value, creating it if it doesn't exist (and thus uploading it to the jobstore), then changing the file.value field to the correlated devirtualized path. This behavior is left over from before when we read virtualized values that were set in the file.value field. Perhaps we should change it so the file.value field always points to a path that exists on the associated worker/leader.

My main concern is figuring out where exactly to do this without breaking path substitution support. In theory, I think at the beginning of every job should be fine.

)

# Apply the shared filesystem path to the virtualized file
set_shared_fs_path(virtualized_file, exported_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an assignment?

@adamnovak
Copy link
Member Author

Steven's File Facts

The original string value is usually stored as the File value, unless the File is virtualized and sent to a task command job, and the task tries to use it, since there inside the task it will have a value pointing to a copy from the file store. If the file is passed through the task at the WDL level, when it gets back to the workflow scope it has the original string in the value.

In the task wrapper the original string path is used.

JSON-based inputs get stored as file:/usr/lib/whatever but string coercions get stored as /usr/lib/whatever.

Strings coerced inside task output sections appear as toilfile: URIs at workflow scope.

@adamnovak adamnovak merged commit 59d5c86 into master Oct 28, 2024
1 check passed
@adamnovak adamnovak deleted the issues/4797-shared-fs-caching branch October 28, 2024 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WDL Call Caching
3 participants